INN Hotels Project¶
Context¶
A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
- Loss of resources (revenue) when the hotel cannot resell the room.
- Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
- Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
- Human resources to make arrangements for the guests.
Objective¶
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
Data Description¶
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
- Booking_ID: unique identifier of each booking
- no_of_adults: Number of adults
- no_of_children: Number of Children
- no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- type_of_meal_plan: Type of meal plan booked by the customer:
- Not Selected – No meal plan selected
- Meal Plan 1 – Breakfast
- Meal Plan 2 – Half board (breakfast and one other meal)
- Meal Plan 3 – Full board (breakfast, lunch, and dinner)
- required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
- room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
- lead_time: Number of days between the date of booking and the arrival date
- arrival_year: Year of arrival date
- arrival_month: Month of arrival date
- arrival_date: Date of the month
- market_segment_type: Market segment designation.
- repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
- no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
- no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
- avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
- no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
- booking_status: Flag indicating if the booking was canceled or not.
Importing necessary libraries and data¶
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Model Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer
)
Import Dataset¶
# Loading data
hotel = pd.read_csv("INNHotelsGroup.csv")
Checking Shape of Data¶
# Checking shape of data
hotel.shape
(36275, 19)
- There are 36275 rows and 19 columns in the dataset.
Displaying sample from dataset¶
# Checking 10 samples in data
hotel.sample(10)
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23539 | INN23540 | 2 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 40 | 2018 | 8 | 28 | Online | 0 | 0 | 0 | 109.80 | 0 | Not_Canceled |
| 5737 | INN05738 | 2 | 0 | 2 | 5 | Meal Plan 1 | 0 | Room_Type 1 | 106 | 2018 | 5 | 8 | Online | 0 | 0 | 0 | 105.28 | 0 | Canceled |
| 27117 | INN27118 | 2 | 0 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 3 | 2017 | 8 | 14 | Online | 0 | 0 | 0 | 146.00 | 1 | Not_Canceled |
| 29106 | INN29107 | 3 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 4 | 1 | 2018 | 12 | 26 | Online | 0 | 0 | 0 | 161.00 | 1 | Not_Canceled |
| 34407 | INN34408 | 2 | 0 | 0 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 270 | 2018 | 4 | 20 | Offline | 0 | 0 | 0 | 62.80 | 0 | Canceled |
| 3496 | INN03497 | 2 | 0 | 2 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 36 | 2018 | 4 | 17 | Online | 0 | 0 | 0 | 100.80 | 0 | Canceled |
| 18209 | INN18210 | 2 | 0 | 0 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 16 | 2018 | 3 | 29 | Online | 0 | 0 | 0 | 141.00 | 1 | Not_Canceled |
| 23855 | INN23856 | 2 | 1 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 113.69 | 3 | Not_Canceled |
| 9501 | INN09502 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 238 | 2018 | 10 | 13 | Offline | 0 | 0 | 0 | 80.00 | 0 | Not_Canceled |
| 14868 | INN14869 | 2 | 0 | 0 | 3 | Meal Plan 1 | 0 | Room_Type 4 | 133 | 2018 | 8 | 16 | Online | 0 | 0 | 0 | 124.80 | 0 | Canceled |
Checking Datatypes of columns¶
## Checking Datatypes of columns
hotel.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
These are following datatypes in the dataset:¶
- Object - Booking ID,type_of_meal_plan,room_type_reserved,market_segment_type,booking_status.
- Int64 - no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,no_of_special_requests.
- Float64 - avg_price_per_room
Displaying statistical summary of data¶
# Checking for Statistical Summary of the data
# Creating two diff dataframes for numbers and objects
num = hotel.select_dtypes(include=np.number)
cat = hotel.select_dtypes('object')
# Checking stats for numbers
num.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
| no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
| required_car_parking_space | 36275.0 | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
| arrival_year | 36275.0 | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.00 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.00 | 10.0 | 12.0 |
| arrival_date | 36275.0 | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 |
| repeated_guest | 36275.0 | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
cat.describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| Booking_ID | 36275 | 36275 | INN00001 | 1 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 |
| market_segment_type | 36275 | 5 | Online | 23214 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 |
Observations:¶
- Number of Adults : Most Bookings include 1-2 adults very few with 4.
- Number of Children : Most bookings include 0 children then followed with 1-2 children .Most of the bookings are may be made by honeymoon couples. Maximum number of children is 10 which is only one booking.
- Number of Weekend Nights : Most of the bookings were only for 1-2 weekend nights.
- Number of Week Nights : Majority were only for 3 week nights ,maximum is 17.
- Required Car parking space : Most of them required no parking space.
- Lead Time :Average Lead Time is 85 days approx 3 months ,Median is 57 days slightly right skewed .Min lead time is 0 and max lead time is 443 days .75% data is 126 days .
- Arrival Year : There are only two years booking data 2017 and 2018 and bookings were more in 2018.
- Arrival Month : More bookings were in September and October.
- Repeated Guest : 97% of the Guests did not repeat .
- No of Previous Cancellations and No of previous bookings not canceled: Is less .
- Average Price Per Room: Average price for room is 103 and median is 99 Euros .Data is uniform .75% of the booking price is 120 euros.Range is in between 0 to 540.
- No of special requests: Most Bookings had 0 or no special requests.
- Type of Meal Plan : 4 Unique Types of meal plans .Meal plan 1 is most requested which is breakfast only.
- Room Type Reserved : There are 7 unique Room Types and Room Type 1 is most requested.
- Market Segment Type: There are 5 types of market segmentation Online,Offline,Corporate,Complementary,Aviation.Most bookings are done online.
- Booking Status:Most bookings are not canceled.
# Checking for duplicates in the data
hotel.duplicated().sum()
0
There are no duplicates in the data.
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Leading Questions:
- What are the busiest months in the hotel?
- Which market segment do most of the guests come from?
- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
- What percentage of bookings are canceled?
- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
Univariate Analysis¶
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
Number of Adults¶
# Barplot for number of adults
labeled_barplot(hotel,'no_of_adults',perc=True)
- Most Bookings were for 2 Adults.
Number of Children¶
# Barplot for number of children
labeled_barplot(hotel,'no_of_children')
92% of bookings were with 0 children.
Number of weekend nights¶
# Barplot for number of weekend nights
labeled_barplot(hotel,'no_of_weekend_nights',perc=True)
- Most bookings did not involve weekend nights ,Few bookings had 1-2 weekend nights.
Number of week nights¶
# Barplot for number of week nights
labeled_barplot(hotel,'no_of_week_nights',perc=True)
- Most bookings were for 1-3 week nights.
Number of special requests¶
#Barplot Number of special requests
labeled_barplot(hotel,'no_of_special_requests',perc=True)
- 55% of bookings had no special requests.
Type of Meal Plan¶
#Barplot Type of meal plan
labeled_barplot(hotel,'type_of_meal_plan',perc=True)
- 76% requests were for meal plan 1(Only Breakfast)
Room Type Reserved¶
##Barplot for room type reserved
labeled_barplot(hotel,'room_type_reserved',perc=True)
- 77% preferred room type 1 .
Market Segment Type¶
#Barplot for market segment type
labeled_barplot(hotel,'market_segment_type',perc=True)
- 64% bookings were through online .
Booking Status¶
#Barplot for booking status
labeled_barplot(hotel,'booking_status',perc=True)
- 67% bookings were not canceled.33% Bookings were canceled.
Arrival Month¶
#Barplot for Arrival Month
labeled_barplot(hotel,'arrival_month',perc=True)
- August ,September and October are more busy months for hotel.
Arrival Year¶
#Barplot for Arrival Year
labeled_barplot(hotel,'arrival_year',perc=True)
- Year 2018 had 64% more bookings than 2017.
Repeated Guests¶
# Barplot for repeated guest
labeled_barplot(hotel,'repeated_guest',perc=True)
- 97% bookings were by new guests.only 2.6% were repeated .
Required Car parking space¶
#Barplot for required car parking space.
labeled_barplot(hotel,'required_car_parking_space',perc=True)
- 97% did not require car parking space . only 3.1% made requests for car parking.
Lead Time¶
#Boxplot for Lead time
histogram_boxplot(hotel,'lead_time')
- There are outliers in the lead time .Average Lead time is 85 days and Median is 57 .Data is right skewed .
Number of previous cancellations¶
#Boxplot for previous cancellations
histogram_boxplot(hotel,'no_of_previous_cancellations')
- Most Previous bookings were not cancelled .
Number of previous bookings not canceled¶
#Boxplot for previous cancellations
histogram_boxplot(hotel,'no_of_previous_bookings_not_canceled')
- Mostly previous bookings were not cancelled.
Average Price Per Room¶
#Boxplot for Average price per room
histogram_boxplot(hotel,'avg_price_per_room')
(hotel['avg_price_per_room']==0).sum()
545
- There are many outliers in the data .Average and Median price are Nearly Close .Average Price is 0 for 545 bookings .May be they were complimentary.
hotel[hotel['avg_price_per_room']==0].market_segment_type.value_counts().reset_index()
| market_segment_type | count | |
|---|---|---|
| 0 | Complementary | 354 |
| 1 | Online | 191 |
- Average Price is 0 for 345 Complementary and 191 Online Segment.
# Calculating the 25th quantile
Q1 = hotel["avg_price_per_room"].quantile(0.25)
# Calculating the 75th quantile
Q3 = hotel["avg_price_per_room"].quantile(0.75) ## Complete the code to calculate 75th quantile for average price per room
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
# assigning the outliers the value of upper whisker
hotel.loc[hotel["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
Lets Encode Booking Status Canceled - 1 and Not Canceled to 0
hotel["booking_status"] = hotel["booking_status"].apply(lambda x: 1 if x =="Canceled" else 0)
Bivariate Analysis¶
##Correlation heat map
plt.figure(figsize=(15, 5))
num = hotel.select_dtypes(include=np.number)
sns.heatmap(num.corr(),annot=True,cmap='coolwarm')
plt.show()
- There is no strong correlation but:
- Lead Time has positive correlation with booking status.
- Average price per room is slightly correlated with number of adults and number of children.
- Number of special requests is slightly negatively correlated with booking status.
Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
#Repeated guest and booking status
stacked_barplot(hotel,'repeated_guest','booking_status')
booking_status 0 1 All repeated_guest All 24390 11885 36275 0 23476 11869 35345 1 914 16 930 ------------------------------------------------------------------------------------------------------------------------
- Guests who repeated didn't cancel much. More cancellations were from new customers.
Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
stacked_barplot(hotel,'no_of_special_requests','booking_status')
booking_status 0 1 All no_of_special_requests All 24390 11885 36275 0 11232 8545 19777 1 8670 2703 11373 2 3727 637 4364 3 675 0 675 4 78 0 78 5 8 0 8 ------------------------------------------------------------------------------------------------------------------------
- More cancelations were for guests who had 0 special requests .Very few cancelations were for 1-2 special requests.
Room Type Reserved and Booking Status
stacked_barplot(hotel,'room_type_reserved','booking_status')
booking_status 0 1 All room_type_reserved All 24390 11885 36275 Room_Type 1 19058 9072 28130 Room_Type 4 3988 2069 6057 Room_Type 6 560 406 966 Room_Type 2 464 228 692 Room_Type 5 193 72 265 Room_Type 7 122 36 158 Room_Type 3 5 2 7 ------------------------------------------------------------------------------------------------------------------------
- Doesn't seem much of a difference for different room types ,but slightly more cancelations were for room type 6.
Booking status across various market segment type
stacked_barplot(hotel,'market_segment_type','booking_status')
booking_status 0 1 All market_segment_type All 24390 11885 36275 Online 14739 8475 23214 Offline 7375 3153 10528 Corporate 1797 220 2017 Aviation 88 37 125 Complementary 391 0 391 ------------------------------------------------------------------------------------------------------------------------
- More cancelations were for online market .
Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
plt.figure(figsize=(15, 5))
sns.boxplot(data=hotel,x='market_segment_type',y='avg_price_per_room',hue='market_segment_type',showfliers=False)
plt.xlabel('Market Segment Type')
plt.ylabel('Average Price per Room')
plt.title('Average Price per Room by Market Segment Type')
plt.show()
- Average price per room is more for online market segment type.
Average Price by Room type Reserved
plt.figure(figsize=(15, 5))
sns.boxplot(data=hotel,x='room_type_reserved',y='avg_price_per_room',hue='room_type_reserved',showfliers=False)
plt.xlabel('Room Type Reserved')
plt.ylabel('Average Price per Room')
plt.title('Average Price per Room by Room Type Reserved')
plt.show()
- Price is little higher for Room type 6 and 7 .We don't know what exactly are these room types .
Average Price per room by Number of special Requests
plt.figure(figsize=(15, 5))
sns.boxplot(data=hotel,x='no_of_special_requests',y='avg_price_per_room',hue='no_of_special_requests',showfliers=False)
plt.xlabel('Number of special requests')
plt.ylabel('Average Price per Room')
plt.title('Average Price per Room by Number of special requests')
plt.show()
- Not significant difference in price of room by number of special requests .Although price is little low if customer had 0 requests.
Lead Time and Booking Status
distribution_plot_wrt_target(hotel,'lead_time','booking_status')
- More cancelations were for booking which had higher lead time ,we saw this from correlation as well.
Average Price per room and booking status
distribution_plot_wrt_target(hotel,'avg_price_per_room','booking_status')
- There is not significant difference in average price per room ,but price is slightly higher for cancelled status.
Busiest Months for the hotel
# Grouping data by arrival months and counting bookings
monthly_data = hotel.groupby(["arrival_month"])["booking_status"].count()
# Creating a DataFrame with months and booking counts
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)
# Plotting the trend over months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data,x="Month",y="Guests")
plt.title("Monthly Trend of Bookings")
plt.xlabel("Month")
plt.ylabel("Number of Guests")
plt.show()
- August ,September and October are the busiest months for hotel.
Average Price for various months
plt.figure(figsize=(10,5))
sns.lineplot(data=hotel,x='arrival_month',y='avg_price_per_room')
plt.xlabel('Arrival Month')
plt.ylabel('Average Price per Room')
plt.title('Average Price per Room by Arrival Month')
plt.show()
- Price is more from month of May till september and then starts to decline in October,because of school holidays.
Analysis for customer who stay atleast for a day at the hotel
stay = hotel[(hotel["no_of_week_nights"] > 0) & (hotel["no_of_weekend_nights"] > 0)]
stay.shape
(17094, 19)
stay["total_days"] = (
stay["no_of_week_nights"] + stay["no_of_weekend_nights"]
)
stacked_barplot(stay,'total_days','booking_status')
booking_status 0 1 All total_days All 10979 6115 17094 3 3689 2183 5872 4 2977 1387 4364 5 1593 738 2331 2 1301 639 1940 6 566 465 1031 7 590 383 973 8 100 79 179 10 51 58 109 9 58 53 111 14 5 27 32 15 5 26 31 13 3 15 18 12 9 15 24 11 24 15 39 20 3 8 11 19 1 5 6 16 1 5 6 17 1 4 5 18 0 3 3 21 1 3 4 22 0 2 2 23 1 1 2 24 0 1 1 ------------------------------------------------------------------------------------------------------------------------
- More number of days stayed ,more chances of cancellation.
Percentage of bookings cancelled each month
stacked_barplot(hotel,'arrival_month','booking_status')
booking_status 0 1 All arrival_month All 24390 11885 36275 10 3437 1880 5317 9 3073 1538 4611 8 2325 1488 3813 7 1606 1314 2920 6 1912 1291 3203 4 1741 995 2736 5 1650 948 2598 11 2105 875 2980 3 1658 700 2358 2 1274 430 1704 12 2619 402 3021 1 990 24 1014 ------------------------------------------------------------------------------------------------------------------------
- More number of cancellations for the month of june,july.
Data Preprocessing¶
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
##Checking for missing values
hotel.isnull().sum()
| 0 | |
|---|---|
| Booking_ID | 0 |
| no_of_adults | 0 |
| no_of_children | 0 |
| no_of_weekend_nights | 0 |
| no_of_week_nights | 0 |
| type_of_meal_plan | 0 |
| required_car_parking_space | 0 |
| room_type_reserved | 0 |
| lead_time | 0 |
| arrival_year | 0 |
| arrival_month | 0 |
| arrival_date | 0 |
| market_segment_type | 0 |
| repeated_guest | 0 |
| no_of_previous_cancellations | 0 |
| no_of_previous_bookings_not_canceled | 0 |
| avg_price_per_room | 0 |
| no_of_special_requests | 0 |
| booking_status | 0 |
- There are no missing values .
Booking id is a column with all unique values and will not help much in modeling ,hence removing it is better.¶
df = hotel.copy()
df.drop('Booking_ID',axis=1,inplace=True)
Outlier Detection¶
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 10))
for col,var in enumerate(numerical_features):
plt.subplot(4,4,col+1)
plt.boxplot(df[var],whis=1.5)
plt.tight_layout()
plt.title(var)
plt.show()
Model Building¶
Model Evaluation Criteria:¶
Model can make following predictions:¶
- In reality booking gets cancelled but model predicted customer will not cancel.
- In reality booking is not cancelled but model predicts it will get cancelled .
Which is more important:?¶
- False Negeatives: Booking is cancelled but model predicts non cancellation , In this case there will be loss of resources and revenue ,because hotel will have to bear the cost of that room staying vacant.
- False Positives: Booking is not cancelled but model predicts cancellation. It will lead to customer dissatisfaction and brand name will be at stake.
Will have to minimize both false negatives and false positives. F1 is a harmonic mean of recall and precision score ,will have to maximize F1
First, let's create functions to calculate different metrics and confusion matrix.¶
- The model_performance_classification_statsmodels function will be used to check the model performance of models.
- The confusion_matrix_statsmodels function will be used to plot the confusion matrix.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Building a Logistic Regression model¶
Creating training and test sets¶
- Encoding Categorical Features .
- Splitting data into training and test sets
# specifying the independent and dependent variables
X = df.drop('booking_status',axis=1)
Y = df['booking_status']
# adding a constant to the independent variables
X = sm.add_constant(X)
# creating dummy variables
X = pd.get_dummies(X,drop_first=True)
# splitting data in train and test sets
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.30,random_state=1)
print("Shape of Training Set :",X_train.shape)
print("Shape of Test Set :",X_test.shape)
print("Percentage of Classes in Training set :")
print(y_train.value_counts(normalize=True))
print("Percentage of Classes in Test set :")
print(y_test.value_counts(normalize=True))
Shape of Training Set : (25392, 28) Shape of Test Set : (10883, 28) Percentage of Classes in Training set : booking_status 0 0.670644 1 0.329356 Name: proportion, dtype: float64 Percentage of Classes in Test set : booking_status 0 0.676376 1 0.323624 Name: proportion, dtype: float64
Building Logistic Regression Model¶
##Fitting Logistic Regression model
logit = sm.Logit(y_train,X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Fri, 06 Dec 2024 Pseudo R-squ.: 0.3292
Time: 22:47:55 Log-Likelihood: -10794.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -924.1330 120.816 -7.649 0.000 -1160.928 -687.338
no_of_adults 0.1136 0.038 3.020 0.003 0.040 0.187
no_of_children 0.1561 0.057 2.727 0.006 0.044 0.268
no_of_weekend_nights 0.1067 0.020 5.394 0.000 0.068 0.145
no_of_week_nights 0.0398 0.012 3.236 0.001 0.016 0.064
required_car_parking_space -1.5942 0.138 -11.564 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.866 0.000 0.015 0.016
arrival_year 0.4567 0.060 7.629 0.000 0.339 0.574
arrival_month -0.0416 0.006 -6.436 0.000 -0.054 -0.029
arrival_date 0.0005 0.002 0.264 0.792 -0.003 0.004
repeated_guest -2.3465 0.617 -3.805 0.000 -3.555 -1.138
no_of_previous_cancellations 0.2663 0.086 3.108 0.002 0.098 0.434
no_of_previous_bookings_not_canceled -0.1727 0.153 -1.131 0.258 -0.472 0.127
avg_price_per_room 0.0188 0.001 25.444 0.000 0.017 0.020
no_of_special_requests -1.4690 0.030 -48.791 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1756 0.067 2.636 0.008 0.045 0.306
type_of_meal_plan_Meal Plan 3 17.4845 4247.878 0.004 0.997 -8308.203 8343.172
type_of_meal_plan_Not Selected 0.2783 0.053 5.247 0.000 0.174 0.382
room_type_reserved_Room_Type 2 -0.3607 0.131 -2.758 0.006 -0.617 -0.104
room_type_reserved_Room_Type 3 -0.0014 1.310 -0.001 0.999 -2.569 2.566
room_type_reserved_Room_Type 4 -0.2829 0.053 -5.320 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7187 0.209 -3.437 0.001 -1.128 -0.309
room_type_reserved_Room_Type 6 -0.9474 0.147 -6.447 0.000 -1.235 -0.659
room_type_reserved_Room_Type 7 -1.3992 0.293 -4.776 0.000 -1.973 -0.825
market_segment_type_Complementary -40.9223 6.23e+05 -6.56e-05 1.000 -1.22e+06 1.22e+06
market_segment_type_Corporate -1.1940 0.266 -4.489 0.000 -1.715 -0.673
market_segment_type_Offline -2.1948 0.255 -8.622 0.000 -2.694 -1.696
market_segment_type_Online -0.3996 0.251 -1.590 0.112 -0.892 0.093
========================================================================================================
Observations
Negative values of the coefficient shows that the probability of a hotel booking being cancelled decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that that probability of a hotel booking being cancelled increases with the increase of corresponding attribute value.
Model performance evaluation¶
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train.astype(float), y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.805963 | 0.634103 | 0.739609 | 0.682804 |
- The model seems to be performing well on the training set.
Lets see the confusion matrix on training set¶
confusion_matrix_statsmodels(lg,X_train.astype(float),y_train)
The confusion matrix¶
- True Positives (TP): Customer cancelled booking and model predicted cancellation.
- True Negatives (TN): Customer did not cancel and model predicted same.
- False Positives (FP):Customer did not cancel but model predicted cancellation.
- False Negatives (FN):Customer cancelled the booking but model did not predict the cancellation.
But to make interpretations from the model, first we will have to remove multicollinearity from the data to get reliable coefficients and p-values.¶
- There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.
Checking for Multicollinearity¶
##function to check VIF
def vif_check(predictors):
vif = pd.DataFrame()
vif['features']=predictors.columns
#caclualting vif for each feature
vif['vif']=[ variance_inflation_factor(predictors.values , i )
for i in range(len(predictors.columns))]
return vif
vif_check(X_train.astype(float))
| features | vif | |
|---|---|---|
| 0 | const | 3.949119e+07 |
| 1 | no_of_adults | 1.348486e+00 |
| 2 | no_of_children | 1.978622e+00 |
| 3 | no_of_weekend_nights | 1.069487e+00 |
| 4 | no_of_week_nights | 1.095670e+00 |
| 5 | required_car_parking_space | 1.039977e+00 |
| 6 | lead_time | 1.395178e+00 |
| 7 | arrival_year | 1.431668e+00 |
| 8 | arrival_month | 1.276373e+00 |
| 9 | arrival_date | 1.006735e+00 |
| 10 | repeated_guest | 1.783612e+00 |
| 11 | no_of_previous_cancellations | 1.395688e+00 |
| 12 | no_of_previous_bookings_not_canceled | 1.651996e+00 |
| 13 | avg_price_per_room | 2.064208e+00 |
| 14 | no_of_special_requests | 1.247302e+00 |
| 15 | type_of_meal_plan_Meal Plan 2 | 1.273250e+00 |
| 16 | type_of_meal_plan_Meal Plan 3 | 1.025217e+00 |
| 17 | type_of_meal_plan_Not Selected | 1.272519e+00 |
| 18 | room_type_reserved_Room_Type 2 | 1.101505e+00 |
| 19 | room_type_reserved_Room_Type 3 | 1.003303e+00 |
| 20 | room_type_reserved_Room_Type 4 | 1.362614e+00 |
| 21 | room_type_reserved_Room_Type 5 | 1.027970e+00 |
| 22 | room_type_reserved_Room_Type 6 | 1.974902e+00 |
| 23 | room_type_reserved_Room_Type 7 | 1.115594e+00 |
| 24 | market_segment_type_Complementary | 4.502286e+00 |
| 25 | market_segment_type_Corporate | 1.692846e+01 |
| 26 | market_segment_type_Offline | 6.411425e+01 |
| 27 | market_segment_type_Online | 7.117686e+01 |
- Only dummy Categorical column has high vif value ,will ignore that.
Dropping high p-value variables¶
- We will drop the predictor variables having a p-value greater than 0.05 as they do not significantly impact the target variable.
- But sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
- Instead, we will do the following:
- Build a model, check the p-values of the variables, and drop the column with the highest p-value.
- Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
- Repeat the above two steps till there are no columns with p-value > 0.05.
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, X_train_aux.astype(float)).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
logit1 = sm.Logit(y_train,X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Fri, 06 Dec 2024 Pseudo R-squ.: 0.3282
Time: 22:48:01 Log-Likelihood: -10810.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -916.8647 120.456 -7.612 0.000 -1152.953 -680.776
no_of_adults 0.1087 0.037 2.916 0.004 0.036 0.182
no_of_children 0.1520 0.057 2.656 0.008 0.040 0.264
no_of_weekend_nights 0.1086 0.020 5.497 0.000 0.070 0.147
no_of_week_nights 0.0417 0.012 3.400 0.001 0.018 0.066
required_car_parking_space -1.5947 0.138 -11.564 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.216 0.000 0.015 0.016
arrival_year 0.4529 0.060 7.587 0.000 0.336 0.570
arrival_month -0.0425 0.006 -6.586 0.000 -0.055 -0.030
repeated_guest -2.7358 0.557 -4.914 0.000 -3.827 -1.645
no_of_previous_cancellations 0.2288 0.077 2.982 0.003 0.078 0.379
avg_price_per_room 0.0192 0.001 26.382 0.000 0.018 0.021
no_of_special_requests -1.4700 0.030 -48.893 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1643 0.067 2.470 0.014 0.034 0.295
type_of_meal_plan_Not Selected 0.2860 0.053 5.407 0.000 0.182 0.390
room_type_reserved_Room_Type 2 -0.3557 0.131 -2.722 0.006 -0.612 -0.100
room_type_reserved_Room_Type 4 -0.2833 0.053 -5.344 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7362 0.208 -3.534 0.000 -1.144 -0.328
room_type_reserved_Room_Type 6 -0.9667 0.147 -6.584 0.000 -1.254 -0.679
room_type_reserved_Room_Type 7 -1.4338 0.293 -4.901 0.000 -2.007 -0.860
market_segment_type_Corporate -0.7927 0.103 -7.710 0.000 -0.994 -0.591
market_segment_type_Offline -1.7855 0.052 -34.377 0.000 -1.887 -1.684
==================================================================================================
#Checking Model Performance
model_performance_classification_statsmodels(lg1,X_train1.astype(float),y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.805372 | 0.632429 | 0.738997 | 0.681572 |
- Performance has not changed much
Converting coefficients to odds¶
- The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
- Therefore, odds = exp(b)
- The percentage change in odds is given as odds = (exp(b) - 1) * 100
# converting coefficients to odds
odds = np.exp(lg1.params)
# adding the odds to a dataframe
pd.DataFrame(odds,X_train1.columns,columns=['odds']).T
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| odds | 0.0 | 1.11487 | 1.164107 | 1.114662 | 1.0426 | 0.202976 | 1.015834 | 1.572905 | 0.958414 | 0.064839 | 1.257046 | 1.019374 | 0.229933 | 1.178549 | 1.331035 | 0.700689 | 0.753265 | 0.478935 | 0.380337 | 0.2384 | 0.45262 | 0.16771 |
Percentage change in odds¶
# finding the percentage change
perc_change_odds = (np.exp(lg1.params)-1)*100
# adding the change_odds% to a dataframe
pd.DataFrame(perc_change_odds,X_train1.columns,columns=['Change_Odds%']).T
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Change_Odds% | -100.0 | 11.487012 | 16.410722 | 11.466158 | 4.259964 | -79.702424 | 1.58342 | 57.29054 | -4.158584 | -93.516077 | 25.704603 | 1.937389 | -77.00672 | 17.854886 | 33.103474 | -29.931089 | -24.673455 | -52.106542 | -61.966306 | -76.159988 | -54.737981 | -83.22899 |
- Number of adults: For each additional adult ,odds of cancellation increase by 11.47%.
- Number of children:For each additional child ,odds of cancellation increase by 16.43%.
- Number of weekend nights: For each additional weekend night booking ,odds of cancellation increase by 11.47%.
- Number of Week nights:For each additional week night booking ,odds of cancellation increase by 4.26%.
- Required Car Parking Space - Odds of cancellation decrease by 79.69% ,if customers get car parking space.
- Lead Time - For each unit increase in lead time ,odds of cancelaation increases by 1.58%.
- Arrival Year - For each unit increase in arrival year ,odds of cancellation increase by 57%.
- Arrival Month - For each unit increase in arrival month ,odds of cancellation decrease by 4.14%.
- Repeated Guests - If customer repeates himself ,odds of cancellation decrease by 93%.
- Number of special requests - For each additional special requests ,odds of cancellation decrease by 77%.
#Confusion matrix
confusion_matrix_statsmodels(lg1,X_train1.astype(float),y_train)
# Checking model performance
log_model_perf_train = model_performance_classification_statsmodels(lg1,X_train1.astype(float),y_train)
print("Training performance:")
log_model_perf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.805372 | 0.632429 | 0.738997 | 0.681572 |
ROC-AUC on training set
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1.astype(float)))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1.astype(float)))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Model Performance Improvement¶
- Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.
Optimal threshold using AUC-ROC curve¶
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1.astype(float)))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.369603791589294
# creating confusion matrix
confusion_matrix_statsmodels(
lg1,X_train1.astype(float),y_train,optimal_threshold_auc_roc
)
#Checking model performance:
log_model_perf_Auc_roc_train = model_performance_classification_statsmodels(lg1,X_train1.astype(float),y_train,optimal_threshold_auc_roc)
print("Training performance:")
log_model_perf_Auc_roc_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.792572 | 0.736339 | 0.667896 | 0.700449 |
- F1 has improved a little bit
Let's use Precision-Recall curve and see if we can find a better threshold¶
y_scores = lg1.predict(X_train1.astype(float))
prec,rec,thre = precision_recall_curve(y_train,y_scores)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, thre)
plt.show()
#setting optimal threshold:
optimal_threshold = 0.42
Checking model performance on training set¶
# creating confusion matrix:
confusion_matrix_statsmodels(lg1,X_train1.astype(float),y_train,optimal_threshold)
#Checking model performance
log_model_perf_pre_rec_train = model_performance_classification_statsmodels(lg1,X_train1.astype(float),y_train,optimal_threshold)
print("Training performance:")
log_model_perf_pre_rec_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.801276 | 0.699271 | 0.697935 | 0.698602 |
Let's check the performance on the test set¶
#Creating Confusion matrix
confusion_matrix_statsmodels(lg1,X_test1.astype(float),y_test)
log_model_perf_test = model_performance_classification_statsmodels(lg1,X_test1.astype(float),y_test)
print("Test performance:")
log_model_perf_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.804649 | 0.630892 | 0.729003 | 0.676408 |
- ROC curve on test set
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1.astype(float)))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1.astype(float)))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Using model with threshold=0.37
#Creating Confusion matrix
confusion_matrix_statsmodels(lg1,X_test1.astype(float),y_test,optimal_threshold_auc_roc)
#Checking model performance test set:
log_model_perf_Auc_roc_test = model_performance_classification_statsmodels(lg1,X_test1.astype(float),y_test,optimal_threshold_auc_roc)
print("Testing performance:")
log_model_perf_Auc_roc_test
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.795553 | 0.739637 | 0.66573 | 0.70074 |
Using model with threshold = 0.42
#Creating Confusion matrix
confusion_matrix_statsmodels(lg1,X_test1.astype(float),y_test,optimal_threshold)
#Checking model performance test set:
log_model_perf_pre_rec_test = model_performance_classification_statsmodels(lg1,X_test1.astype(float),y_test,optimal_threshold)
print("Testing performance:")
log_model_perf_pre_rec_test
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.803639 | 0.703861 | 0.693815 | 0.698802 |
Model performance summary¶
# training performance comparison
models_train_comp_df = pd.concat(
[
log_model_perf_train.T,
log_model_perf_Auc_roc_train.T,
log_model_perf_pre_rec_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.805372 | 0.792572 | 0.801276 |
| Recall | 0.632429 | 0.736339 | 0.699271 |
| Precision | 0.738997 | 0.667896 | 0.697935 |
| F1 | 0.681572 | 0.700449 | 0.698602 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_model_perf_test.T,
log_model_perf_Auc_roc_test.T,
log_model_perf_pre_rec_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.804649 | 0.795553 | 0.803639 |
| Recall | 0.630892 | 0.739637 | 0.703861 |
| Precision | 0.729003 | 0.665730 | 0.693815 |
| F1 | 0.676408 | 0.700740 | 0.698802 |
Final Model Summary¶
Model with threshold 0.42 balances the recall and precision with F1 score of 0.70.
All the logistic regression models have given a generalized performance on both the training and test set, indicating this model should perform similarly for INN Hotels in production.
Coefficients for the number of adults, the lead time prior to a booking, the arrival year (i.e., 2018 v. 2017), the number of previous cancellations, the average room price, the total nights booked, selecting Meal Plan 2, and not selecting a meal plan are all positive, meaning an increase in these will lead to increase in chances of a hotel booking being cancelled.
Coefficients for requiring a parking space, arrival month, being a repeat guest, the number of special requests, and the market segments for Corporate and Offline are all negative, meaning an increase in these will lead to decrease in chances of a hotel booking being cancelled.
Building a Decision Tree model¶
Decision Tree¶
Data Preparation for modeling (Decision Tree)¶
- We'll encode categorical features.
- We'll split data into train and test set
X = df.drop('booking_status',axis=1)
Y = df['booking_status']
X = pd.get_dummies(X,drop_first=True)
# splitting data into train and test
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.30,random_state=1)
print("Shape of Training set:",X_train.shape[0])
print("Shape of Test set:",X_test.shape[0])
Shape of Training set: 25392 Shape of Test set: 10883
print("Percentage of classes in Training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in Test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in Training set: booking_status 0 0.670644 1 0.329356 Name: proportion, dtype: float64 Percentage of classes in Test set: booking_status 0 0.676376 1 0.323624 Name: proportion, dtype: float64
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.¶
- The model_performance_classification_sklearn function will be used to check the model performance of models.
- The confusion_matrix_sklearnfunction will be used to plot the confusion matrix.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Model Building Decision Tree¶
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
Checking Model performance on training set:¶
# creating confusion matrix for training set:
confusion_matrix_sklearn(model,X_train,y_train)
# checking model performance on training data:
decision_model_perf_train = model_performance_classification_sklearn(model,X_train,y_train)
print("Training performance:")
decision_model_perf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.994211 | 0.986608 | 0.995776 | 0.991171 |
- Model is trying to overfit for training data with 99% accuracy.
Checking Model performance on test set:¶
#creating confusion matrix for test:
confusion_matrix_sklearn(model,X_test,y_test)
#checking model performance for test data:
decision_model_perf_test = model_performance_classification_sklearn(model,X_test,y_test)
print("Testing performance:")
decision_model_perf_test
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.871175 | 0.811755 | 0.794608 | 0.80309 |
- Model is giving good results on test set
Let's check important features¶
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8,8))
plt.title('Feature Importances')
plt.barh(range(len(indices)),importances[indices],color='violet',align='center')
plt.yticks(range(len(indices)),[feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Lead time,Average Price per room are most important features.¶
Do we need to prune the tree?¶
Pruning the Tree¶
Pre-Pruning
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)Checking performance on Training Set¶
#confusion matrix train set:
confusion_matrix_sklearn(estimator,X_train,y_train)
#Checking performance for training data:
decision_model_tune_perf_train = model_performance_classification_sklearn(estimator,X_train,y_train)
print("Training performance:")
decision_model_tune_perf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83097 | 0.786082 | 0.724248 | 0.753899 |
Checking performance on Test set¶
confusion_matrix_sklearn(estimator,X_test,y_test)
decision_model_tune_perf_test = model_performance_classification_sklearn(estimator,X_test,y_test)
print("Testing performance:")
decision_model_tune_perf_test
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.834972 | 0.783362 | 0.727584 | 0.754444 |
After hyperparameter tuning model performance has gone down.¶
Visualizing Decision Tree¶
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 133.59] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
##Feature Important for tree
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8,8))
plt.title('Feature Importances')
plt.barh(range(len(indices)),importances[indices],color='violet',align='center')
plt.yticks(range(len(indices)),[feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
- Lead Time is still more important feature ,Market segment type online is more important than average price per room
Cost Complexity Pruning
clf = DecisionTreeClassifier(random_state=1,class_weight='balanced')
path = clf.cost_complexity_pruning_path(X_train,y_train)
ccp_alphas,impurities = abs(path.ccp_alphas) ,path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | 0.008376 |
| 1 | 0.000000e+00 | 0.008376 |
| 2 | 2.933821e-20 | 0.008376 |
| 3 | 2.933821e-20 | 0.008376 |
| 4 | 2.933821e-20 | 0.008376 |
| ... | ... | ... |
| 1853 | 8.901596e-03 | 0.328058 |
| 1854 | 9.802243e-03 | 0.337860 |
| 1855 | 1.271875e-02 | 0.350579 |
| 1856 | 3.412090e-02 | 0.418821 |
| 1857 | 8.117914e-02 | 0.500000 |
1858 rows × 2 columns
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1],impurities[:-1],marker='o',drawstyle='steps-post')
ax.set_xlabel('effective alpha')
ax.set_ylabel('total impurity of leaves')
ax.set_title('Total Impurity vs effective alpha for training set')
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1,ccp_alpha= ccp_alpha,class_weight='balanced')
clf.fit(X_train,y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha {}".format(
clfs[-1].tree_.node_count,ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha 0.08117914389136915
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
F1 Score vs alpha for training and test set
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train,pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test,pred_test)
f1_test.append(values_test)
fig,ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("Alpha vs F1 Score for train and test set")
ax.plot(ccp_alphas,f1_train,marker='o',label='train',drawstyle='steps-post')
ax.plot(ccp_alphas,f1_test,marker='o',label='test',drawstyle='steps-post')
ax.legend()
plt.show()
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167048,
class_weight='balanced', random_state=1)
Checking performance on training set¶
confusion_matrix_sklearn(best_model,X_train,y_train)
decision_model_post_perf_train = model_performance_classification_sklearn(best_model,X_train,y_train)
decision_model_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.899929 | 0.902906 | 0.813685 | 0.855977 |
Checking performance on test set¶
confusion_matrix_sklearn(best_model,X_test,y_test)
decision_model_post_perf_test = model_performance_classification_sklearn(best_model,X_test,y_test)
decision_model_post_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.869429 | 0.856616 | 0.767099 | 0.80939 |
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | |--- lead_time <= 16.50 | | | | | | | | |--- avg_price_per_room <= 68.50 | | | | | | | | | |--- weights: [207.26, 10.63] class: 0 | | | | | | | | |--- avg_price_per_room > 68.50 | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | |--- weights: [2.24, 7.59] class: 1 | | | | | | | |--- lead_time > 16.50 | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- no_of_previous_bookings_not_canceled <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_previous_bookings_not_canceled > 0.50 | | | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [21.62, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | |--- weights: [1199.59, 1.52] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- arrival_month <= 9.50 | | | | | | | |--- avg_price_per_room <= 63.29 | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- weights: [41.75, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | |--- avg_price_per_room <= 59.75 | | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | | |--- weights: [1.49, 12.14] class: 1 | | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | | |--- weights: [14.91, 1.52] class: 0 | | | | | | | | | |--- avg_price_per_room > 59.75 | | | | | | | | | | |--- lead_time <= 44.00 | | | | | | | | | | | |--- weights: [0.75, 59.21] class: 1 | | | | | | | | | | |--- lead_time > 44.00 | | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 63.29 | | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | | |--- weights: [20.13, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | |--- arrival_month > 9.50 | | | | | | | |--- weights: [413.04, 27.33] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- lead_time <= 81.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 81.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- weights: [55.17, 3.04] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- lead_time <= 73.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- lead_time > 73.50 | | | | | | | | | | |--- weights: [21.62, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- avg_price_per_room <= 132.43 | | | | | | | | | |--- weights: [9.69, 122.97] class: 1 | | | | | | | | |--- avg_price_per_room > 132.43 | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | |--- no_of_previous_cancellations <= 0.50 | | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | | |--- weights: [2.24, 118.41] class: 1 | | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- no_of_previous_cancellations > 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- arrival_date <= 11.50 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- arrival_date > 11.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- weights: [23.11, 6.07] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [5.96, 9.11] class: 1 | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- weights: [59.64, 3.04] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | |--- weights: [1.49, 16.70] class: 1 | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 86.00 | | | | | | | | | | | |--- weights: [2.24, 16.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 86.00 | | | | | | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [44.73, 4.55] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- arrival_date <= 11.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- weights: [16.40, 39.47] class: 1 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- weights: [20.13, 6.07] class: 0 | | | | | | |--- arrival_date > 11.50 | | | | | | | |--- avg_price_per_room <= 102.09 | | | | | | | | |--- weights: [5.22, 144.22] class: 1 | | | | | | | |--- avg_price_per_room > 102.09 | | | | | | | | |--- avg_price_per_room <= 109.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [0.75, 16.70] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [33.55, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 109.50 | | | | | | | | | |--- avg_price_per_room <= 124.25 | | | | | | | | | | |--- weights: [2.98, 75.91] class: 1 | | | | | | | | | |--- avg_price_per_room > 124.25 | | | | | | | | | | |--- weights: [3.73, 3.04] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- weights: [38.02, 0.00] class: 0 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- avg_price_per_room <= 93.58 | | | | | | | | |--- avg_price_per_room <= 65.38 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- avg_price_per_room > 65.38 | | | | | | | | | |--- weights: [24.60, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 93.58 | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | |--- weights: [14.91, 72.87] class: 1 | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | |--- weights: [9.69, 1.52] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- no_of_adults <= 1.50 | | | | | | | |--- weights: [84.25, 0.00] class: 0 | | | | | | |--- no_of_adults > 1.50 | | | | | | | |--- lead_time <= 125.50 | | | | | | | | |--- avg_price_per_room <= 90.85 | | | | | | | | | |--- avg_price_per_room <= 87.50 | | | | | | | | | | |--- weights: [13.42, 13.66] class: 1 | | | | | | | | | |--- avg_price_per_room > 87.50 | | | | | | | | | | |--- weights: [0.00, 15.18] class: 1 | | | | | | | | |--- avg_price_per_room > 90.85 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | | |--- lead_time > 125.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- weights: [58.15, 18.22] class: 0 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- weights: [61.88, 1.52] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- avg_price_per_room <= 70.05 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 70.05 | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [38.77, 1.52] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | | |--- weights: [34.30, 40.99] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 74.21 | | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | | | |--- avg_price_per_room > 74.21 | | | | | | | | | | | |--- weights: [9.69, 0.00] class: 0 | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | |--- weights: [4.47, 10.63] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | |--- weights: [155.07, 6.07] class: 0 | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- weights: [3.73, 10.63] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- avg_price_per_room <= 202.67 | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- weights: [63.37, 30.36] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | | |--- weights: [115.56, 12.14] class: 0 | | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- weights: [28.33, 3.04] class: 0 | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | |--- avg_price_per_room > 202.67 | | | | | | | |--- weights: [0.75, 22.77] class: 1 | | | | | |--- lead_time > 3.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- avg_price_per_room <= 119.25 | | | | | | | | |--- avg_price_per_room <= 118.50 | | | | | | | | | |--- weights: [18.64, 59.21] class: 1 | | | | | | | | |--- avg_price_per_room > 118.50 | | | | | | | | | |--- weights: [8.20, 1.52] class: 0 | | | | | | | |--- avg_price_per_room > 119.25 | | | | | | | | |--- weights: [34.30, 171.55] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [26.09, 1.52] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 14.00 | | | | | | | | | | |--- weights: [9.69, 36.43] class: 1 | | | | | | | | | |--- arrival_date > 14.00 | | | | | | | | | | |--- avg_price_per_room <= 208.67 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 208.67 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- avg_price_per_room <= 59.43 | | | | | | | |--- lead_time <= 84.50 | | | | | | | | |--- weights: [50.70, 7.59] class: 0 | | | | | | | |--- lead_time > 84.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_date <= 27.00 | | | | | | | | | | |--- lead_time <= 131.50 | | | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | | | | | |--- lead_time > 131.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 27.00 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 59.43 | | | | | | | |--- lead_time <= 25.50 | | | | | | | | |--- weights: [20.88, 6.07] class: 0 | | | | | | | |--- lead_time > 25.50 | | | | | | | | |--- avg_price_per_room <= 71.34 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- lead_time <= 68.50 | | | | | | | | | | | |--- weights: [15.66, 78.94] class: 1 | | | | | | | | | | |--- lead_time > 68.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- lead_time <= 102.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 102.00 | | | | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 71.34 | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- avg_price_per_room <= 120.45 | | | | | | | | | |--- weights: [79.77, 9.11] class: 0 | | | | | | | | |--- avg_price_per_room > 120.45 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [3.73, 12.14] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- weights: [16.40, 47.06] class: 1 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- weights: [0.00, 63.76] class: 1 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 104.31 | | | | | | | | |--- lead_time <= 25.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [16.40, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- weights: [38.77, 118.41] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [23.11, 0.00] class: 0 | | | | | | | | |--- lead_time > 25.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [39.51, 185.21] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [73.81, 411.41] class: 1 | | | | | | | |--- avg_price_per_room > 104.31 | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 195.30 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- avg_price_per_room > 195.30 | | | | | | | | | | | |--- weights: [0.75, 138.15] class: 1 | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [11.18, 6.07] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [0.75, 9.11] class: 1 | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | |--- avg_price_per_room <= 168.06 | | | | | | | | | | |--- lead_time <= 22.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 22.00 | | | | | | | | | | | |--- weights: [17.15, 83.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 168.06 | | | | | | | | | | |--- weights: [12.67, 6.07] class: 0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- lead_time <= 63.00 | | | | | | | |--- weights: [15.66, 1.52] class: 0 | | | | | | |--- lead_time > 63.00 | | | | | | | |--- weights: [0.00, 7.59] class: 1 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- lead_time <= 105.00 | | | | | | | |--- weights: [0.75, 6.07] class: 1 | | | | | | |--- lead_time > 105.00 | | | | | | | |--- weights: [31.31, 13.66] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_week_nights <= 10.00 | | | | | | | |--- weights: [498.03, 40.99] class: 0 | | | | | | |--- no_of_week_nights > 10.00 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- lead_time > 4.50 | | | | | | |--- arrival_date <= 13.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- weights: [58.90, 36.43] class: 0 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- weights: [33.55, 1.52] class: 0 | | | | | | |--- arrival_date > 13.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [123.76, 9.11] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- avg_price_per_room <= 126.33 | | | | | | | | | |--- weights: [32.80, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 126.33 | | | | | | | | | |--- weights: [9.69, 13.66] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 118.55 | | | | | | | |--- lead_time <= 61.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [70.08, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [126.74, 1.52] class: 0 | | | | | | | |--- lead_time > 61.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | |--- weights: [4.47, 57.69] class: 1 | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | |--- lead_time <= 66.50 | | | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 66.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- avg_price_per_room <= 71.93 | | | | | | | | | | | |--- weights: [54.43, 3.04] class: 0 | | | | | | | | | | |--- avg_price_per_room > 71.93 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | |--- avg_price_per_room > 118.55 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | | | |--- avg_price_per_room <= 177.15 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 177.15 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- avg_price_per_room <= 121.20 | | | | | | | | | | | |--- weights: [18.64, 6.07] class: 0 | | | | | | | | | | |--- avg_price_per_room > 121.20 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- lead_time <= 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- weights: [11.93, 10.63] class: 0 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- weights: [37.28, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- avg_price_per_room <= 119.20 | | | | | | | | | | | |--- weights: [9.69, 28.84] class: 1 | | | | | | | | | | |--- avg_price_per_room > 119.20 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 100.00 | | | | | | | | | | | |--- weights: [49.95, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 100.00 | | | | | | | | | | | |--- weights: [0.75, 18.22] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | |--- lead_time <= 6.50 | | | | | | | | |--- weights: [32.06, 0.00] class: 0 | | | | | | | |--- lead_time > 6.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 5.50 | | | | | | | | | | |--- weights: [23.11, 1.52] class: 0 | | | | | | | | | |--- arrival_date > 5.50 | | | | | | | | | | |--- avg_price_per_room <= 93.09 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 93.09 | | | | | | | | | | | |--- weights: [77.54, 27.33] class: 0 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [19.38, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- avg_price_per_room <= 202.95 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- weights: [1.49, 9.11] class: 1 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- weights: [175.20, 28.84] class: 0 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | |--- avg_price_per_room > 202.95 | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | |--- arrival_month > 8.50 | | | | | | |--- avg_price_per_room <= 153.15 | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | |--- avg_price_per_room <= 71.12 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 71.12 | | | | | | | | | |--- avg_price_per_room <= 90.42 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- weights: [12.67, 7.59] class: 0 | | | | | | | | | |--- avg_price_per_room > 90.42 | | | | | | | | | | |--- weights: [64.12, 60.72] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 153.15 | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- arrival_month <= 5.00 | | | | | | | |--- weights: [2.98, 0.00] class: 0 | | | | | | |--- arrival_month > 5.00 | | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- weights: [46.97, 9.11] class: 0 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | | |--- weights: [0.00, 13.66] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- weights: [188.62, 7.59] class: 0 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- weights: [13.42, 27.33] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- lead_time <= 285.50 | | | | | | | |--- weights: [8.20, 0.00] class: 0 | | | | | | |--- lead_time > 285.50 | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- lead_time <= 244.00 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 166.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 166.50 | | | | | | | | | | | |--- weights: [2.24, 57.69] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [17.89, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- weights: [11.18, 3.04] class: 0 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [75.30, 12.14] class: 0 | | | | | | | |--- lead_time > 244.00 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [25.35, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [11.18, 264.15] class: 1 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [46.22, 0.00] class: 0 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- lead_time <= 324.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- weights: [7.46, 986.78] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | |--- lead_time > 324.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- weights: [0.75, 13.66] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [1.49, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- arrival_date <= 1.50 | | | | | | | |--- weights: [1.49, 3.04] class: 1 | | | | | | |--- arrival_date > 1.50 | | | | | | | |--- weights: [35.79, 1.52] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- avg_price_per_room <= 96.37 | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 96.37 | | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- weights: [7.46, 206.46] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- avg_price_per_room <= 76.48 | | | | | | | |--- weights: [46.97, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 76.48 | | | | | | | |--- no_of_week_nights <= 6.50 | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | |--- lead_time <= 233.00 | | | | | | | | | | |--- lead_time <= 152.50 | | | | | | | | | | | |--- weights: [1.49, 4.55] class: 1 | | | | | | | | | | |--- lead_time > 152.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 233.00 | | | | | | | | | | |--- weights: [23.11, 19.74] class: 0 | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [2.24, 15.18] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- lead_time <= 269.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 269.00 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | |--- no_of_week_nights > 6.50 | | | | | | | | |--- weights: [4.47, 13.66] class: 1 | | | | | |--- arrival_month > 11.50 | | | | | | |--- arrival_date <= 14.50 | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | |--- arrival_date > 14.50 | | | | | | | |--- weights: [11.18, 31.88] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Comparing Decision Tree Models¶
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_model_perf_train.T,
decision_model_tune_perf_train.T,
decision_model_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.994211 | 0.830970 | 0.899929 |
| Recall | 0.986608 | 0.786082 | 0.902906 |
| Precision | 0.995776 | 0.724248 | 0.813685 |
| F1 | 0.991171 | 0.753899 | 0.855977 |
# Test performance comparison
models_test_comp_df = pd.concat(
[
decision_model_perf_test.T,
decision_model_tune_perf_test.T,
decision_model_post_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.871175 | 0.834972 | 0.869429 |
| Recall | 0.811755 | 0.783362 | 0.856616 |
| Precision | 0.794608 | 0.727584 | 0.767099 |
| F1 | 0.803090 | 0.754444 | 0.809390 |
- Both models have performed well after tuning but decision model is giving better result than logistic regression model.
- Model is performing better after post pruning .Able to predict well better on both train and test data.
- Decision tree is performing better than Logistic model considering F1 score.
Actionable Insights and Recommendations¶
- Offer free cancellations if made a specific number of days before check-in (e.g., 14 days or more). This encourages early planning and gives the hotel time to rebook the room.
- Allow cancellations closer to the check-in date (e.g., 7-13 days) with a 50% refund.
- For cancellations made within 7 days of check-in, no refund is provided.Instead they can be given credit(valid for 12 months) for that amount for future booking.It will encourage rebooking .
- During peak season ,higher fees for cancellation closer to check in date.
- Off season - can be lenient and give lucrative packages .
- Create a loyalty program where members get more lenient cancellation policies. For example:
- Standard customers: Strict cancellation fees.
- Premium members: Flexible or free cancellation options.
- For repeated customers ,there should be given offers such as room upgrades ,complimentary services like spa ,meals.They can be given reward points to encourage them book further in future.
- Free Parking space, other special requests made should be considered to prevent cancellations.